# Set seed
set.seed(7)
# Importing data
loanData <- read_csv("loan_sample_10.csv")
## Rows: 40000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): grade, home_ownership, verification_status, purpose, application_type
## dbl (12): loan_amnt, int_rate, annual_inc, dti, open_acc, revol_bal, revol_u...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data <- loanData
# Checking structure of the data
head(data)
## # A tibble: 6 × 17
## loan_amnt int_rate grade home_ownership annual_inc verification_status purpose
## <dbl> <dbl> <chr> <chr> <dbl> <chr> <chr>
## 1 20000 11.5 B MORTGAGE 110000 Source Verified credit…
## 2 21000 6.97 A MORTGAGE 95000 Not Verified debt_c…
## 3 16000 11.0 B MORTGAGE 104000 Source Verified debt_c…
## 4 8400 6.92 A MORTGAGE 48000 Not Verified debt_c…
## 5 24000 14.5 C RENT 110000 Not Verified debt_c…
## 6 9700 18.5 D OWN 37000 Verified debt_c…
## # ℹ 10 more variables: dti <dbl>, open_acc <dbl>, revol_bal <dbl>,
## # revol_util <dbl>, total_acc <dbl>, total_rec_int <dbl>,
## # application_type <chr>, tot_cur_bal <dbl>, total_rev_hi_lim <dbl>,
## # Status <dbl>
tail(data)
## # A tibble: 6 × 17
## loan_amnt int_rate grade home_ownership annual_inc verification_status purpose
## <dbl> <dbl> <chr> <chr> <dbl> <chr> <chr>
## 1 13325 10.2 B MORTGAGE 32240 Not Verified debt_c…
## 2 6000 11.0 B RENT 50000 Not Verified credit…
## 3 16000 12.7 C OWN 68000 Source Verified credit…
## 4 14400 14.5 C MORTGAGE 61000 Source Verified debt_c…
## 5 9600 10.4 B RENT 30000 Not Verified debt_c…
## 6 9000 9.93 B RENT 95000 Not Verified debt_c…
## # ℹ 10 more variables: dti <dbl>, open_acc <dbl>, revol_bal <dbl>,
## # revol_util <dbl>, total_acc <dbl>, total_rec_int <dbl>,
## # application_type <chr>, tot_cur_bal <dbl>, total_rev_hi_lim <dbl>,
## # Status <dbl>
The head() and tail() functions show, that
there is no unusual observation and the data set looks complete and has
no emtpy rows at the end.
# Define colors to create a general style
custom_fill <- "#105F77"
custom_alpha <- 0.8
custom_color <- "#161616"
custom_light <- "#f4a460"
custom_dark <- "#105F77"
# Check for missing values in each column of the data set and store the result in the "has_NAs" data frame
has_NAs = as.data.frame(apply(data, 2, function(x) any(is.na(x))))
# Map matching column names for a more readable output
colnames(has_NAs) = c("has_NA")
# Print the has_NAs data frame to check if there are any NA values in our data set
has_NAs
## has_NA
## loan_amnt FALSE
## int_rate FALSE
## grade FALSE
## home_ownership FALSE
## annual_inc FALSE
## verification_status FALSE
## purpose FALSE
## dti FALSE
## open_acc FALSE
## revol_bal FALSE
## revol_util FALSE
## total_acc FALSE
## total_rec_int FALSE
## application_type FALSE
## tot_cur_bal FALSE
## total_rev_hi_lim FALSE
## Status FALSE
We also found out, that none of the variables has any NA
values.
This list describes the variables of the dataset loan_data:
annual_inc: The self-reported annual income provided by the borrower
during registration
application_type: Indicates whether the loan is an individual or joint
application
dti: Borrower’s total monthly debt payments divided by monthly
income
grade: Assigned loan grade by the financial service provider
home_ownership: The home ownership status
int_rate: Interest Rate on the loan
loan_amnt: The listed amount of the loan applied for by the
borrower
open_acc: Number of open trades in last 6 months
purpose: A category provided by the borrower for the loan request
revol_bal: Total credit revolving balance
revol_util: Revolving line utilization rate
tot_cur_bal: Total current balance of all accounts
total_acc: The total number of credit lines currently in the borrower’s
credit file
total_rec_int: Interest received to date
total_rev_hi_lim: Total revolving high credit/credit limit
verification_status: Indicates if the co-borrowers’ joint income was
verified
# First, we check the dimension of the dataset
dim(data)
## [1] 40000 17
With the dim function we see, that there are 40’000
observations within 17 variables.
# Next, we display the structure of the dataset, showing the type of each variable (numeric, categorical, etc.).
str(data)
## spc_tbl_ [40,000 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ loan_amnt : num [1:40000] 20000 21000 16000 8400 24000 9700 9000 5500 10000 12000 ...
## $ int_rate : num [1:40000] 11.53 6.97 10.99 6.92 14.49 ...
## $ grade : chr [1:40000] "B" "A" "B" "A" ...
## $ home_ownership : chr [1:40000] "MORTGAGE" "MORTGAGE" "MORTGAGE" "MORTGAGE" ...
## $ annual_inc : num [1:40000] 110000 95000 104000 48000 110000 37000 50000 28000 80000 42500 ...
## $ verification_status: chr [1:40000] "Source Verified" "Not Verified" "Source Verified" "Not Verified" ...
## $ purpose : chr [1:40000] "credit_card" "debt_consolidation" "debt_consolidation" "debt_consolidation" ...
## $ dti : num [1:40000] 24.68 11.15 9.66 20.9 18.3 ...
## $ open_acc : num [1:40000] 16 9 3 7 10 14 12 7 12 10 ...
## $ revol_bal : num [1:40000] 15299 17900 8945 2341 17641 ...
## $ revol_util : num [1:40000] 45.5 70.2 65.8 17.3 68.1 55.4 35 24.6 33.4 52.1 ...
## $ total_acc : num [1:40000] 32 19 22 18 31 16 19 21 25 27 ...
## $ total_rec_int : num [1:40000] 3721 1789 1308 920 3233 ...
## $ application_type : chr [1:40000] "Individual" "Individual" "Individual" "Individual" ...
## $ tot_cur_bal : num [1:40000] 346890 282222 22696 38145 62426 ...
## $ total_rev_hi_lim : num [1:40000] 33600 25500 13600 13500 25900 6900 27100 21500 33500 21800 ...
## $ Status : num [1:40000] 0 0 0 0 1 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. loan_amnt = col_double(),
## .. int_rate = col_double(),
## .. grade = col_character(),
## .. home_ownership = col_character(),
## .. annual_inc = col_double(),
## .. verification_status = col_character(),
## .. purpose = col_character(),
## .. dti = col_double(),
## .. open_acc = col_double(),
## .. revol_bal = col_double(),
## .. revol_util = col_double(),
## .. total_acc = col_double(),
## .. total_rec_int = col_double(),
## .. application_type = col_character(),
## .. tot_cur_bal = col_double(),
## .. total_rev_hi_lim = col_double(),
## .. Status = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
The str function shows us all the different variables
and what type they’re from.
num_vars <- sum(sapply(data, is.numeric))
cat_vars <- sum(sapply(data, is.character))
# Print counts
cat("Number of numeric variables:", num_vars, "\n")
## Number of numeric variables: 12
cat("Number of categorical variables:", cat_vars, "\n")
## Number of categorical variables: 5
Reporting the structure of the data set shows us, that we have 5 character columns and 12 number columns.
summary(data)
## loan_amnt int_rate grade home_ownership
## Min. : 1000 Min. : 5.31 Length:40000 Length:40000
## 1st Qu.: 7000 1st Qu.: 9.44 Class :character Class :character
## Median :10000 Median :12.29 Mode :character Mode :character
## Mean :11668 Mean :12.63
## 3rd Qu.:15075 3rd Qu.:15.05
## Max. :40000 Max. :27.49
## annual_inc verification_status purpose dti
## Min. : 5000 Length:40000 Length:40000 Min. : 0.00
## 1st Qu.: 42000 Class :character Class :character 1st Qu.:12.12
## Median : 57000 Mode :character Mode :character Median :17.62
## Mean : 63327 Mean :18.22
## 3rd Qu.: 77000 3rd Qu.:23.89
## Max. :400000 Max. :60.14
## open_acc revol_bal revol_util total_acc
## Min. : 1.0 Min. : 0 Min. : 0.00 Min. : 3.00
## 1st Qu.: 8.0 1st Qu.: 5606 1st Qu.: 34.70 1st Qu.:15.00
## Median :10.0 Median : 9782 Median : 52.30 Median :20.00
## Mean :10.3 Mean :11939 Mean : 52.07 Mean :21.31
## 3rd Qu.:13.0 3rd Qu.:15759 3rd Qu.: 69.70 3rd Qu.:27.00
## Max. :23.0 Max. :78762 Max. :123.20 Max. :57.00
## total_rec_int application_type tot_cur_bal total_rev_hi_lim
## Min. : 0.0 Length:40000 Min. : 0 Min. : 300
## 1st Qu.: 676.6 Class :character 1st Qu.: 25166 1st Qu.: 13000
## Median :1341.8 Mode :character Median : 53872 Median : 20800
## Mean :1817.5 Mean : 99134 Mean : 24158
## 3rd Qu.:2431.3 3rd Qu.:158901 3rd Qu.: 32100
## Max. :8826.0 Max. :467467 Max. :100000
## Status
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.1285
## 3rd Qu.:0.0000
## Max. :1.0000
While checking the summary, we can observe, that there might be some
values with insane outliers. For example annual_inc,
revol_bal or tot_cur_bal. These variables all
have a small minimum compared to the median and the max value. This
could mean that either the data is very wide spread or there are a few
outliers that have an impact on the median and the mean.
ggplot(data, aes(x = factor(Status))) +
geom_bar(fill = custom_fill, color = custom_color) +
labs(x = "Loan Status", y = "Count", title = "Distribution of Loan Status")
Checking the levels of the
loan status variable, we can
see, that the distribution is not balanced. There are much more entries
with status = 0 than status = 1.
# This code generates a histogram for each numeric variable
# Function to plot histogram for each numeric variable
plot_histograms <- function(data) {
numeric_vars <- sapply(data, is.numeric)
data_numeric <- data[, numeric_vars]
# Loop through each numeric variable
for(var in names(data_numeric)) {
print(
ggplot(data, aes_string(x = var)) +
geom_histogram(bins = 30, fill = custom_fill, color = custom_color) +
labs(x = var, y = "Frequency", title = paste("Distribution of", var)) +
theme_minimal()
)
}
}
# Apply the function to the loan dataset
plot_histograms(data)
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Function to plot box plots for each numeric variable
# The function plot_boxplots identifies all numeric variables in the dataset and creates box plots for each.
plot_boxplots <- function(data) {
numeric_vars <- sapply(data, is.numeric)
data_numeric <- data[, numeric_vars]
# Loop through each numeric variable
for(var in names(data_numeric)) {
print(
ggplot(data, aes_string(y = var)) +
geom_boxplot(fill = custom_fill, color = custom_color) +
labs(y = var, title = paste("Boxplot of", var)) +
theme_minimal()
)
}
}
# Apply the function to the loan dataset
plot_boxplots(data)
To visualize the distribution of numerical features, box plots provide a clear summary of the data distribution and highlight medians, quartiles and outliers. When split by a categorical variable such as loan status, they allow for direct comparison between groups. This is particularly useful for identifying which numerical characteristics might behave differently in the two categories of the target characteristic and provides insight into which variables might be significant predictors.
# Visualizing Numeric Features by Target Feature
# This code generates box plots for each numeric variable, separated by the target feature 'Status'.
# Function to create box plots for each numeric variable, separated by loan status
plot_numeric_by_target <- function(data, target) {
numeric_vars <- sapply(data, is.numeric)
data_numeric <- data[, numeric_vars]
for (var in names(data_numeric)) {
# Corrected usage of print() to render the plots
print(
ggplot(data, aes_string(x = target, y = var)) +
geom_boxplot(fill = custom_fill, color = custom_color) +
labs(title = paste("Distribution of", var, "by", target), x = target, y = var) +
theme_minimal()
)
}
}
# Apply the function to the loan dataset
plot_numeric_by_target(data, 'Status')
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
Discussion of the visualization
After running the code below, we will get box plots for each numeric
variable, broken down by loan status. These plots show the median
(middle line in the box) of each group, the dispersion (size of the
box), which shows the variability and the ossible outliers (points
outside the whiskers).
Based on the generated boxplots for a subset of numeric variables in
the loan_data dataset, in this section is our discussion of the findings
presented:
Loan Amount (loan_amnt)
The distribution of loan amounts seems relatively similar for both
default and non-default groups. However, there’s a slight indication
that higher loan amounts might be more prevalent in the default group.
This suggests that as the loan amount increases, there might be a
slightly higher risk of default.
Interest Rate (int_rate)
The interest rate shows a notable difference between the two groups.
Loans that end up in default tend to have higher interest rates. This is
evident from the higher median and the spread of the box for the default
group. Higher interest rates could be an indicator of increased risk
associated with the loan, leading to a higher likelihood of default.
Annual Income (annual_inc)
The annual income of borrowers does not show a significant difference in
the medians of the two groups, although the non-default group seems to
have a slightly higher income range. This might suggest that while
income is a factor, it’s not a strong differentiator between default and
non-default cases on its own.
Debt-to-Income Ratio (dti)
The debt-to-income ratio appears slightly higher for the default group.
This suggests that a higher DTI ratio might be associated with a greater
risk of default. The higher spread in the default group indicates a
wider variability in DTI ratios among those who default.
Revolving Balance (revol_bal) The revolving balance shows a relatively similar distribution across both groups, with a slightly higher median in the non-default group. This indicates that revolving balance alone might not be a strong predictor of default.
From this analysis, it seems that the interest rate and debt-to-income ratio are particularly relevant variables in predicting the likelihood of loan default. Higher interest rates and higher DTI ratios are more common in loans that default, suggesting these factors are important in assessing credit risk. While loan amount, annual income, and revolving balance do show some differences, their impact appears to be less pronounced compared to interest rates and DTI ratios.
The code in this section creates bar plots for each categorical variable, showing their association with the loan status (default vs. non-default).
#Visualizing Associations with Bar Plots
# Function to plot bar plots for each categorical variable
plot_categorical_associations <- function(data, target) {
categorical_vars <- sapply(data, is.factor) # or is.character, depending on how data is loaded
data_categorical <- data[, categorical_vars]
for(var in names(data_categorical)) {
# Create the plot and assign it to a variable
plot <- ggplot(data, aes_string(x = var)) +
geom_bar(position = "fill", fill = custom_fill, color = custom_color) +
labs(title = paste("Association between", var, "and", target), x = var, y = "Proportion") +
theme_minimal() +
scale_fill_brewer(palette = "Set1") +
guides(fill = guide_legend(title = target))
# Print the plot
print(plot)
}
}
# Apply the function to the loan dataset
plot_categorical_associations(data, 'Status')
From these plots, we can observe the proportion of default and non-default cases within each category of the categorical variables. The significant differences in the proportion of default vs. non-default cases across the categories of a variable suggest a potential association. For example, if a certain loan purpose or grade shows a noticeably higher proportion of defaults, it may indicate that these variables are relevant in predicting loan defaults.
The following points describe our findings:
Loan Grade (grade):
There is a noticeable trend in loan defaults across different loan
grades. Lower grades (typically representing higher risk) show a higher
proportion of defaults. This suggests that the loan grade is a
significant factor in predicting loan defaults.
Home Ownership (home_ownership)
The distribution of loan status varies across different home ownership
categories. However, the variation is not as pronounced as with loan
grades. It appears that borrowers with mortgages have a slightly higher
count of non-defaults compared to other categories.
Verification Status (verification_status)
The verification status of the borrower also shows an interesting
pattern. Loans with verified statuses tend to have a lower proportion of
defaults compared to those that are not verified or are source verified.
This implies that verification status might play a role in predicting
loan defaults.
Purpose of the Loan (purpose)
The purpose of the loan demonstrates varying patterns of defaults.
Certain purposes, like debt consolidation, show a higher number of
defaults. This variation suggests that the reason for taking out a loan
could be an indicator of default risk.
Application Type (application_type)
The application type, whether individual or joint, shows some
differences in default rates. Individual applications seem to have a
slightly higher proportion of defaults compared to joint applications,
indicating that the type of application might be a factor in default
risk.
From these observations, it’s evident that certain categorical variables such as loan grade and loan purpose show a more pronounced association with loan status, indicating their potential relevance in predicting loan defaults. Variables like home ownership and application type also show some association but to a lesser degree. The verification status, interestingly, has a noticeable impact, suggesting that the level of scrutiny in the loan approval process can affect the likelihood of a loan defaulting. These insights are valuable for understanding the risk factors associated with loan defaults and can guide further analysis and modeling efforts.
The heatmap of the correlation matrix for the numeric features in the loan_data dataset reveals the following insights:
# The code first selects only numeric variables from our dataset. It then calculates the correlation matrix using Pearson correlation. The melt function from the reshape2 package is used to transform the correlation matrix into a long format suitable for ggplot. Finally, a heatmap is plotted using ggplot2 with appropriate color scales to represent the strength and direction of correlations.
# Selecting only the numeric variables from the dataset
numeric_vars <- data[sapply(data, is.numeric)]
# Calculate the correlation matrix
correlation_matrix <- cor(numeric_vars, use = "complete.obs")
# Melt the correlation matrix for visualization
melted_correlation_matrix <- melt(correlation_matrix)
# Plotting the correlation matrix
ggplot(melted_correlation_matrix, aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = custom_light, high = custom_dark, mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1),
axis.text.y = element_text(size = 12)) +
labs(x = '', y = '', title = 'Correlation Matrix of Numeric Features')
The scatter plot showing the relationship between the loan amount requested and the annual income of the borrowers reveals the following insights:
# We plot an interactive scatter plot showing the association between the loan amount requested and the annual income of the borrower
# Creating an interactive scatter plot
plot <- plot_ly(data, x = ~annual_inc, y = ~loan_amnt, type = "scatter", mode = "markers",
marker = list(size = 10, opacity = 0.5, color = custom_fill),
hoverinfo = "text",
text = ~paste("Annual Income:", annual_inc, "<br>Loan Amount:", loan_amnt))
# Adding layout details
plot <- plot %>% layout(title = "Scatter Plot of Loan Amount vs Annual Income",
xaxis = list(title = "Annual Income"),
yaxis = list(title = "Loan Amount"))
# Display the plot
plot
Association Between Loan Amount and Annual
Income
Positive Correlation: There appears to be a positive correlation between
the loan amount and the annual income. As the annual income increases,
the loan amounts also tend to increase. This suggests that borrowers
with higher incomes are likely to request larger loans.
Variability in Loan Amounts
Despite the positive trend, there is considerable variability in the
loan amounts across different income levels. This indicates that factors
other than income also play a significant role in determining the loan
amount.
Income Range
The plot shows a wide range of incomes among borrowers. Those with lower
incomes tend to request smaller loans, while those with higher incomes
have a broader range of loan amounts.
Outliers and Spread
There are some outliers, particularly at higher income levels, where
borrowers with very high incomes request relatively small loan amounts.
Additionally, the spread of loan amounts widens with increasing income,
suggesting more variability in the loan amounts requested by
higher-income borrowers.
Conclusion and Decision on Variables
High Correlation: Given the positive correlation, both annual_inc
(annual income) and loan_amnt (loan amount) are important variables for
predicting loan behavior and should be considered in the predictive
model.
Keeping Both Variables: Although they are correlated, both variables likely contribute unique information. Annual income is a key factor in a borrower’s ability to repay, while the loan amount is directly related to the loan’s terms and risk. Therefore, it would be beneficial to keep both variables in the model.
In summary, the association between loan amount and annual income is significant and positive, indicating the importance of these variables in the predictive model. Their relationship suggests that while income is a factor in determining loan amounts, other variables also influence the final loan amount requested.
In this step we creating a balanced dataset where the two levels of the target variable (‘Status’) are equally represented, and then we visualize this with a bar plot.
# In this last step we build a balanced dataset
# Check the balance of the original dataset
table(data$Status)
##
## 0 1
## 34861 5139
# Create a balanced dataset
# First, separate the dataset into two based on the target variable
data_default <- data %>% filter(Status == 1)
data_no_default <- data %>% filter(Status == 0)
# Then, randomly sample from the larger group to match the size of the smaller group
set.seed(7) # for reproducibility
data_no_default_balanced <- data_no_default %>% sample_n(nrow(data_default))
# Combine the two balanced datasets
balanced_data <- rbind(data_default, data_no_default_balanced)
# Create a bar plot of the newly created target variable
ggplot(balanced_data, aes(x = factor(Status))) +
geom_bar(fill = custom_fill, color = custom_color) +
labs(title = "Bar Plot of Balanced Target Variable", x = "Loan Status", y = "Count")
# Check the balance of the new dataset
table(balanced_data$Status)
##
## 0 1
## 5139 5139
Balancing the Dataset is important for the following different points:
Avoid Bias in Models
When a dataset is imbalanced, especially in a binary classification
problem, models can become biased towards the majority class. This can
lead to poor generalization performance on the minority class.
Better Performance Metrics
Balanced datasets provide a more realistic evaluation of model
performance metrics, especially those sensitive to class imbalance like
accuracy, precision, and recall.
Reflect Real-World Scenarios
In some cases, balancing a dataset can help models learn patterns that
are not overrepresented by the majority class, thus better reflecting
real-world scenarios where the target classes are more evenly
distributed.